1.1 Introduction

2. Data

3.Feature Engineering and Selection

4.Model Estimation and Validation

5. Conclusion

1.1 Introduction

Zillow has realized that its housing market predictions aren’t as accurate as they could be because they do not factor in enough local intelligence. Our group has been selected to build a better predictive model of home prices for San Francisco. Because there are so many variables and elements to influence home value in San Francisco, it is such a tough task to perform prediction. Even though it may takes substantial times and energy to gather and process raw data, it will still be interesting to take this challenge.

As far as this project, we will bring geospatial analysis and machine learning techniques together to generate our model. The goal of geospatial prediction is to borrow the experience from one place and test the extent to which that experience generalizes to another place. The machine learning process will be divided into four steps. Below is a brief discussion on each step of the process.

Data wrangling: The first step of the process is to gather and compile the appropriate data, often from multiple disparate sources, into one consistent dataset.This means the analyst has to understand the nature of the data, how it was collected and how to massage the raw data into useful predictive features.
Exploratory analysis: Thoughtful exploratory analyis often leads to more useful predictive models. Additionally, more reasonable exploratory research will make the analysis more interpretable for non-technical clients.
Feature engineering: Feature engineering is the vital point to differ a great model from a good one. Features are the variables used to predict the outcome of interest - in this case home prices. The more the analyst can convert data into useful features, the better her model will perform. There are two key considerations for doing this well.
Feature selection: In a geospatial prediction project, it is possible to contain hundreds of features in the dataset. It is much wiser to select useful features in a model instead of implementing all of them. The feature selection process is the process of whittling down all the possible features into a concise and parsimonious set that optimizes for accuracy and generalizability.
Model estimation and validation: In this project, we will develope a a home price prediction algorithm by using linear regression models. To aviod bias and inaccuracy, we will perform plenty of methods to validate a machine learning model for accuracy and generalizability in the following report.

Generally, the home value model we built can predict approximately 85 percent of the total homes from 2012-2015. The error is about 180,000 dollars. Considering the situation of home prediction difficulty, we think this model is well-performed.

2. Data

2.1 Gathering the Data and Data Wrangling

Our first geospatial machine learning model will be trained on home price data from San Francisco - one of the largest cities in United States featuring some of the nation’s best travel destinations and a very robust technology sector. In this section, we will create a mapTheme, read the data, re-project the data to State Plane (2227), a coordinate system encoded as feet.

2.1.1 Dependent Variable: San Francisco Home Price Data

Next, we loaded one shapefile sp_original which has 10131 house sale data points from 2012 to 2015. A map of the SF house price is presented below using this data set.

2.1.2 Independent Variables

The data set above also includes various characteristics for San Francisco homes that are useful to our model construction, including:

sale price
built year (house age is converted using houseAge = 2019 - BuiltYear)
number of rooms, bedrooms, bathrooms (3 signigicant outliers are mannually changed according to the neighboring room numbers)
property and lot area (2 signigicant outliers are mannually changed according to the median value)
property class
sale date (date is seperated into year, month, and day)
number of stories (4 signigicant outliers are mannually changed according to Google Street Map)

Census tract data tracts is downloaded using tigirs, which is a powerful package to access data in United States Census Bureau. A couple of tracts are removed from the original dataset because of the scope. Both data is reprojected to the appropriate State Plane.

Various data sets are downloaded from the San Francisco Open Data, which consists of the city’s economic, environmental, social, and many other aspects data. From this website, we have downloaded:

road network (distance from house to road is calculated)
transportation stops (both the number of transit stations within )
park
school
slope
commercial area and open space (plazas)
crime
tree canopy
water bodies
hazards
neighborhoods
height limits in zones
zoning
technology companies, arts/entertainment/recreation, real estate and leasing services, financial services, private health services

Census data is downloaded through the tidycensus package, which allows access to both the decennial and the ACS census data. Here, we have chosen the ACS 5 year estimates in 2017 at the census tract level to work on. Both the ACS profile and the detailed table are used. Variables selected from the census include:

percent of residents 25 years or older holding a bachelor’s degree or higher
percent of 18 years and younger
percent of 65 years and older
percent of foreign born residents
percent of residence in the same house for 1 year and over
percent of vacant houses
percent of detached houses
percent of renters
percent of households with 2 vehicles
percent of households with 3 or more vehicles
median home value
median home rent
median household income
total population
white population
black population
asian population
hispanic population

In the end, the locations of the precidio, twin peaks, the golden gate and the University of San Francisco are mannually combined into one data set.

2.2 Exploratory Analysis

2.2.1 Summary Statistics

We have categorized our predictors into three types: internal characteristics, amenities/public services, and spatial structue. Their summary statistics are presented below. Note that a few variables are not in the tables below since they are spatial or categorical data, including the neighborhoods and zonings.

2.2.2 Correlation between all the variables

A correlation matrix is produced to see the relationship between the continuous variables. The non-continuous variables, including all the census data variables, are dropped. To further simplify the correlation plot, only one variable is kept for each amenity.

A series explorations of correlation between the dependent variable sale price and other predictors are conducted using the scatter plots. Here, four of the most well-fitted scatter plots are presented. Fianc_nn5 is the distance to 5 Nearest financial services, ind_nn3 is the distance to 3 nearest attractions, pcsamehouse is the percentage of residents lived in the same house for over 1 year, and tech_nn5 is the distance to 5 nearest tech companies.

2.2.3 Spatial Distribution

Three of the independent variables are mapped below. For the first two variables, both the density of the amenity and the distance to the 5 nearest amenities are mapped. For the third variable, elevation, both the point elevation and the countour lines are mapped.

3.Feature Engineering and Selection

In this project, we are using Ordinary Least Square (OLS) as the model to estimate house prices. OLS is able to find the linear relationship between the dependent variable, which in this case is the house sale price, and the predictors, which are the variables mentioned above in the data collection section. The feature engineering is done in the following steps.

In order to construct and choose the most effective features, we ran an OLS linear regression model for all the predictors to test the fitness of the model. The output gives both the p-value and the coefficient for each variable. The p-value reflects the significance of the coefficient, in other words, the probability of having the coefficient assuming there is no correlation between the independent and the dependent variables. Typically, we want p-value to be smaller than 0.05. R-squared is also calculated, which demonstrates the amount of variance in the dependent variable that is explained by the predictors. A larger R-squared means a better estimation of the model and is desired.
We compare the p-values of the variables to check their significances in relation to the house price. This gives us a direction of what variables to keep or change. After removing or adding variables, we ran OLS model again to compare the R-squared value. Multiple trials are done to select the best set of predictors.
A series of feature engineering is done throughout the trial process. We have log-transformed the price related to the predictors in order to make them better fit for the linear model. We also have made several variables into categories, since continuous variables do not always have the most predictive power. For example, in a demand perspective, a potential mansion purchaser would not care about the exact numbers of the rooms, therefore, a house with 6 or 7 bedrooms might not influence the final house price that much.
We also have used the stepwise regression using the stepAIC function in the MASS package, which automatically finds the best set of predictors with the least R-squared value. However, since we have more model validation in the next step, we only used this function result as a reference.

4.Model Estimation and Validation

4.1 Methods

It is critical that models generalize to data they haven’t seen before. The R^2 measures estimated measure error based on the data from which the model was trained. Below, data is split into training and test datasets. Models will be trained on the former and tested on the latter.

As we mentioned above, the total dataset is divided into training and test part, which represent 60% and 40% respectively. The table below discribes in-sample (training set) model results. We inlude all the variables in the training set, so that it is clear to see which variable has more significant influence. The estimate illustrates the coiefficient, the next three columns are standard error, t-value and p-value. By scrolling the table, it is easier to compare these variables.

	Estimate	Std..Error	t.value	Pr…t..
(Intercept)	4.2639562	0.6616621	6.4443106	0.0000000
PropClassCD	0.5755506	0.1481561	3.8847586	0.0001036
PropClassCDA	0.4618116	0.1559621	2.9610491	0.0030786
PropClassCF	0.5390863	0.1514164	3.5602904	0.0003735
PropClassCLZ	0.4236747	0.1603939	2.6414631	0.0082775
PropClassCOZ	0.0765424	0.2709996	0.2824445	0.7776130
PropClassCTH	0.3770257	0.1633152	2.3085764	0.0210029
PropClassCTIC	-0.1104511	0.1507787	-0.7325379	0.4638705
PropClassCZ	0.3901430	0.1485135	2.6269861	0.0086376
PropClassCZBM	-0.0865445	0.1647161	-0.5254164	0.5993143
PropClassCZEU	1.0460826	0.2574477	4.0632822	0.0000490
LotArea	0.0000001	0.0000000	4.2285058	0.0000239
PropArea	0.0001913	0.0000073	26.2462328	0.0000000
Stories	0.0121555	0.0051416	2.3641616	0.0181044
Rooms_cat0	0.2359082	0.0392759	6.0064410	0.0000000
Rooms_cat1-3	-0.0148150	0.0413008	-0.3587109	0.7198247
Rooms_cat10-12	0.1495376	0.0329817	4.5339511	0.0000059
Rooms_cat4-7	0.1787997	0.0345255	5.1787739	0.0000002
Rooms_cat7-9	0.1735513	0.0329259	5.2709730	0.0000001
Beds_cat1-2	-0.0151342	0.0081737	-1.8515878	0.0641370
Beds_cat3-5	0.0296502	0.0077134	3.8439851	0.0001224
Beds_cat6-	-0.0832438	0.0305800	-2.7221647	0.0065054
Baths	0.0089753	0.0045971	1.9523729	0.0509431
SaleYr	0.1311513	0.0025949	50.5411144	0.0000000
crime_nn5	-0.0000147	0.0000285	-0.5174862	0.6048369
school_nn3	-0.0000127	0.0000088	-1.4452189	0.1484516
park_nn1	-0.0000029	0.0000077	-0.3721269	0.7098123
Distance_N	-0.0000058	0.0000055	-1.0501808	0.2936797
Distance.y	-0.0000082	0.0000088	-0.9266685	0.3541381
totaltree	0.0000000	0.0000000	1.1994673	0.2303964
transit_cat100-150	0.0599836	0.0573878	1.0452324	0.2959601
transit_cat150-200	0.0861039	0.0557840	1.5435237	0.1227595
transit_cat200-250	0.0855872	0.0543222	1.5755482	0.1151857
transit_cat250-300	0.0875179	0.0537792	1.6273553	0.1037172
transit_cat300-350	0.0803336	0.0533225	1.5065603	0.1319791
transit_cat350-400	0.0995047	0.0536454	1.8548610	0.0636679
transit_cat400-450	0.0329856	0.0541271	0.6094109	0.5422765
transit_cat50-100	0.1033631	0.0643256	1.6068725	0.1081380
hght_cat<50	0.0465552	0.0743450	0.6262044	0.5312061
hght_cat>100	0.0020995	0.0785863	0.0267158	0.9786874
elev.cat1>700	-0.0069740	0.0493066	-0.1414417	0.8875260
elev.cat1100-150	-0.0213546	0.0177310	-1.2043673	0.2284979
elev.cat1150-200	0.0018289	0.0187736	0.0974185	0.9223975
elev.cat1200-250	-0.0168260	0.0194673	-0.8643183	0.3874496
elev.cat1250-300	-0.0036226	0.0206547	-0.1753887	0.8607805
elev.cat1300-350	-0.0142960	0.0217078	-0.6585648	0.5102020
elev.cat1350-400	-0.0236293	0.0239669	-0.9859120	0.3242183
elev.cat1400-450	0.0017987	0.0257546	0.0698416	0.9443222
elev.cat1450-500	-0.0191671	0.0303781	-0.6309509	0.5280980
elev.cat150~100	-0.0215976	0.0151805	-1.4227154	0.1548736
elev.cat1500-550	-0.0462475	0.0344019	-1.3443309	0.1788951
elev.cat1550-600	-0.0199127	0.0385885	-0.5160265	0.6058560
elev.cat1600-650	0.0093565	0.0405282	0.2308637	0.8174290
elev.cat1650-700	-0.0673372	0.0462339	-1.4564458	0.1453248
zoning_simRH-1	-0.0104756	0.0152618	-0.6863949	0.4924922
zoning_simRH-1(D)	0.0073797	0.0216408	0.3410112	0.7331078
zoning_simRH-23	-0.0056301	0.0135150	-0.4165862	0.6769969
med_hhincome_ln	0.0424431	0.0376544	1.1271756	0.2597159
pcWhite	-0.1031036	0.1075948	-0.9582583	0.3379734
pcBlack	-0.4778166	0.1632112	-2.9275970	0.0034295
pcAsian	-0.2010922	0.1152188	-1.7453079	0.0809854
pcHispanic	-0.2882608	0.1115376	-2.5844265	0.0097789
pcbachedegree	0.0004485	0.0009339	0.4802262	0.6310851
houseAge	0.0000693	0.0000153	4.5405581	0.0000057
pcunder18	0.0025540	0.0016514	1.5465598	0.1220252
pcabove65	0.0020981	0.0015027	1.3962367	0.1626978
pc2vehicles	-0.0026315	0.0008696	-3.0260450	0.0024889
totalpop	0.0000059	0.0000030	1.9608177	0.0499491
pc3ormorevehicles	-0.0020078	0.0011340	-1.7704616	0.0767040
pcrenter	0.0000642	0.0006946	0.0924928	0.9263098
pcdetached	0.0003550	0.0005725	0.6200595	0.5352435
pcvacant	-0.0023714	0.0013304	-1.7824454	0.0747301
pcsamehouse	-0.0006046	0.0011584	-0.5219337	0.6017369
medianhomevalue_ln	0.0534132	0.0403019	1.3253265	0.1851162
medianrent_ln	-0.0171124	0.0204715	-0.8359132	0.4032389
water	-0.0000038	0.0000033	-1.1610185	0.2456832
Pr_hlth_nn5	0.0000107	0.0000105	1.0151693	0.3100684
Financ_nn5	-0.0000303	0.0000081	-3.7224370	0.0001992
ind_nn3	-0.0000013	0.0000033	-0.3888641	0.6973912
nbrhoodAnza Vista	0.0575581	0.0989783	0.5815227	0.5609113
nbrhoodBalboa Terrace	-0.3414999	0.0897596	-3.8046055	0.0001435
nbrhoodBayview	-0.4190595	0.0853798	-4.9081799	0.0000009
nbrhoodBayview Heights	-0.3519784	0.0922180	-3.8168089	0.0001366
nbrhoodBernal Heights	-0.1443371	0.0727673	-1.9835438	0.0473548
nbrhoodBuena Vista Park/Ashbury Heights	-0.0781552	0.0682499	-1.1451333	0.2522022
nbrhoodCandlestick Point	0.0188987	0.1155143	0.1636047	0.8700482
nbrhoodCentral Richmond	-0.2476063	0.0714027	-3.4677434	0.0005287
nbrhoodCentral Sunset	-0.2857717	0.0731820	-3.9049441	0.0000953
nbrhoodCentral Waterfront/Dogpatch	-0.0296298	0.1217184	-0.2434289	0.8076819
nbrhoodClarendon Heights	-0.0140666	0.0842719	-0.1669192	0.8674396
nbrhoodCole Valley/Parnassus Heights	-0.0698871	0.0704930	-0.9914047	0.3215303
nbrhoodCorona Heights	-0.0431005	0.0706445	-0.6101044	0.5418171
nbrhoodCow Hollow	0.0770465	0.0739473	1.0419113	0.2974972
nbrhoodCrocker Amazon	-0.3186373	0.0849190	-3.7522508	0.0001770
nbrhoodDiamond Heights	-0.2154483	0.0766903	-2.8093279	0.0049815
nbrhoodDowntown	0.0040233	0.1276963	0.0315070	0.9748663
nbrhoodDuboce Triangle	0.0077086	0.0821412	0.0938451	0.9252355
nbrhoodEureka Valley / Dolores Heights	-0.0228854	0.0656139	-0.3487884	0.7272611
nbrhoodExcelsior	-0.3493096	0.0807957	-4.3233689	0.0000156
nbrhoodFinancial District/Barbary Coast	0.0008044	0.1696964	0.0047400	0.9962182
nbrhoodForest Hill	-0.2138032	0.0789687	-2.7074404	0.0068007
nbrhoodForest Hills Extension	-0.2356711	0.0839459	-2.8074169	0.0050111
nbrhoodForest Knolls	-0.3230069	0.0822533	-3.9269780	0.0000870
nbrhoodGlen Park	-0.1629115	0.0715477	-2.2769636	0.0228255
nbrhoodGolden Gate Heights	-0.2383626	0.0766724	-3.1088444	0.0018875
nbrhoodHaight Ashbury	-0.0071368	0.0740746	-0.0963458	0.9232494
nbrhoodHayes Valley	-0.0292133	0.0675228	-0.4326439	0.6652899
nbrhoodHunters Point	-0.4161678	0.1034729	-4.0219975	0.0000585
nbrhoodIngleside	-0.3114883	0.0784659	-3.9697294	0.0000728
nbrhoodIngleside Heights	-0.3169694	0.0818353	-3.8732598	0.0001086
nbrhoodIngleside Terrace	-0.3457296	0.0841945	-4.1063214	0.0000408
nbrhoodInner Mission	-0.1027365	0.0719483	-1.4279201	0.1533699
nbrhoodInner Parkside	-0.2817343	0.0734866	-3.8338199	0.0001275
nbrhoodInner Richmond	-0.1557548	0.0724680	-2.1492916	0.0316534
nbrhoodInner Sunset	-0.2237886	0.0705252	-3.1731716	0.0015159
nbrhoodJordan Park / Laurel Heights	-0.0455100	0.0762639	-0.5967443	0.5507019
nbrhoodLake Shore	-0.2345527	0.0864963	-2.7117077	0.0067139
nbrhoodLake Street	-0.1125872	0.0746021	-1.5091701	0.1313110
nbrhoodLakeside	-0.3057071	0.1113176	-2.7462609	0.0060469
nbrhoodLincoln Park	-0.0699205	0.1407764	-0.4966777	0.6194356
nbrhoodLittle Hollywood	-0.1886448	0.1049682	-1.7971606	0.0723633
nbrhoodLone Mountain	-0.1296185	0.0747212	-1.7346960	0.0828490
nbrhoodLower Pacific Heights	-0.1179994	0.0697430	-1.6919167	0.0907168
nbrhoodMarina	0.0286487	0.0760877	0.3765217	0.7065431
nbrhoodMerced Heights	-0.2522350	0.0810730	-3.1112076	0.0018725
nbrhoodMerced Manor	-0.3212896	0.0938382	-3.4238668	0.0006217
nbrhoodMidtown Terrace	-0.2482746	0.0816748	-3.0397935	0.0023782
nbrhoodMiraloma Park	-0.2631985	0.0773026	-3.4047819	0.0006668
nbrhoodMission Bay	0.4323335	0.2288954	1.8887821	0.0589720
nbrhoodMission Dolores	-0.0185553	0.0765687	-0.2423348	0.8085295
nbrhoodMission Terrace	-0.3109898	0.0778739	-3.9935057	0.0000659
nbrhoodMonterey Heights	-0.2483566	0.0876482	-2.8335638	0.0046195
nbrhoodMount Davidson Manor	-0.3192811	0.0829064	-3.8511031	0.0001189
nbrhoodNob Hill	0.0943596	0.1023746	0.9217098	0.3567191
nbrhoodNoe Valley	-0.0380521	0.0665507	-0.5717762	0.5674962
nbrhoodNorth Beach	0.1127144	0.1156069	0.9749800	0.3296117
nbrhoodNorth Panhandle	-0.0508552	0.0687353	-0.7398704	0.4594092
nbrhoodNorth Waterfront	0.0618623	0.1681141	0.3679781	0.7129033
nbrhoodOceanview	-0.3301454	0.0820171	-4.0253255	0.0000576
nbrhoodOuter Mission	-0.3500469	0.0825430	-4.2407807	0.0000226
nbrhoodOuter Parkside	-0.2963253	0.0761693	-3.8903536	0.0001012
nbrhoodOuter Richmond	-0.2682177	0.0749564	-3.5783159	0.0003487
nbrhoodOuter Sunset	-0.2742651	0.0752594	-3.6442667	0.0002706
nbrhoodPacific Heights	0.1322949	0.0690104	1.9170302	0.0552843
nbrhoodParkside	-0.2863582	0.0744282	-3.8474445	0.0001207
nbrhoodPine Lake Park	-0.3217252	0.0925085	-3.4777922	0.0005094
nbrhoodPortola	-0.3293526	0.0834588	-3.9462880	0.0000803
nbrhoodPotrero Hill	-0.1122644	0.0736244	-1.5248256	0.1273583
nbrhoodPresidio Heights	0.0164464	0.0785913	0.2092651	0.8342488
nbrhoodRussian Hill	0.1100434	0.0784759	1.4022566	0.1608933
nbrhoodSaint Francis Wood	-0.1263398	0.0789991	-1.5992562	0.1098194
nbrhoodSea Cliff	-0.0233098	0.0878560	-0.2653179	0.7907742
nbrhoodSherwood Forest	-0.1980640	0.0974260	-2.0329689	0.0421021
nbrhoodSilver Terrace	-0.3218188	0.0857195	-3.7543226	0.0001755
nbrhoodSouth Beach	0.1648346	0.1162110	1.4184086	0.1561264
nbrhoodSouth of Market	-0.0424642	0.0802069	-0.5294325	0.5965262
nbrhoodStonestown	0.0271216	0.1517653	0.1787076	0.8581737
nbrhoodSunnyside	-0.2499599	0.0733553	-3.4075225	0.0006601
nbrhoodTelegraph Hill	0.0409399	0.0875861	0.4674251	0.6402137
nbrhoodTenderloin	0.5418892	0.3081128	1.7587362	0.0786762
nbrhoodTwin Peaks	-0.0780165	0.0807414	-0.9662522	0.3339592
nbrhoodVan Ness/Civic Center	-0.1093575	0.1655360	-0.6606269	0.5088784
nbrhoodVisitacion Valley	-0.3092418	0.0883157	-3.5015479	0.0004661
nbrhoodWest Portal	-0.1892915	0.0786002	-2.4082837	0.0160594
nbrhoodWestern Addition	-0.1499799	0.0863862	-1.7361556	0.0825906
nbrhoodWestwood Highlands	-0.2498507	0.0872422	-2.8638755	0.0042003
nbrhoodWestwood Park	-0.2045752	0.1110738	-1.8417948	0.0655573
nbrhoodYerba Buena	0.0444113	0.1848998	0.2401913	0.8101906
lagPrice_ln	-0.1988308	0.0164408	-12.0937309	0.0000000
SalePrice.buffv_ln	0.6296930	0.0171373	36.7439033	0.0000000

The summary of r-squared and adjusted r-squared for the training sample are also demonstrated below. Generally, 85% of the variance of the dependent variable are explained in this prediction.

c..Training..	summary.Training..r.squared	summary.Training..adj.r.squared
Training	0.8576808	0.8534425

As we analysis the test set which occupied 40% of the total set, the prediction model still has some inacurracy. The graphic below shows the saleprice compared with saleprice predict.

Generally, in this test set, the mean absError is almost 180000, while the mean absError percent is about 16%.

model	mean_saleprice.AbsError	mean_saleprice.APE_percent
test	178002.6	15.74407

Another diagram which can illustrate the distribution of prediction errors in test set. The majority are within 0-200000 absError range, while some of them are beyond 400000.

Estimating a model on training set and predicting for test set is a good way to understand how well the model might generalize or predict for homes that haven’t actually sold. However, in order to improve generalizability, we perform cross-validation method, which will process the entire data 100 times. Cross-validation allows one to judge generalizability not on one random hold out but on many, helping to ensure that the goodness of fit on one hold out is not a fluke.

Source:https://en.wikipedia.org/wiki/Cross-validation_(statistics)#K-fold_cross-validation

Below is the result of our cross-validation model.we didn’t use the log price data in this model because it’s hard to transform back to the actual price in Cross-Validation. In this case, the MAE is a littile bit high which is about 200000 compared with 180000 before.

##   intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      TRUE 299118.1 0.8165478 203908.9 38288.06 0.04399277 18957.19

When we use the linear model to visualize the predicted and actual sale price, it is obvious that there are still some error existing.

Two maps below illustrates the test set sale price and absolute sale price errors. It looks like more errors happened in center and northern part of San Francisco, which are the higher housing value areas.

The Moran’s I statistic is a statistical hypothesis test that asks whether and to what extent a spatial phenomenon exhibits a given spatial process.In practice, the test is looking to see how local means deviate from the global mean. A Moran’s I that is positive nearing the value of 1, describes positive spatial autocorrelation, also known as clustering. Instances where high and low prices “repel” one another, are said to be dispersed. Finally, where positive and negative values are randonly distributed, the Moran’s I statistic is 0.

In our model, the Moran’s I is 0.02 which is a good number. In other words, it means our model is randonly distributed.

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  filter(sp.test, !is.na(SalePrice.Error))$SalePrice.Error 
## weights: spatialWeights.test  
## number of simulations + 1: 1000 
## 
## statistic = 0.016313, observed rank = 944, p-value = 0.056
## alternative hypothesis: greater

Another map provides the predicted values for the entire dataset It is clear our model is consistent across the neighborhoods to some extend. We can conclude that our model is generalizable. However, more errors occur when we predict wealthier communities.

Finally, when we divide income groups to test the generalizibilty, it has the same result as we disccused above. Higher income areas have lower prediction accuraccy.

4.2. Discussion

The MAPE is about 16%, which means on average, our model only deviates from the actual price by 16%, which means it is pretty effective for house price estimations.

Feature engineering and selection is a very important step in our model construction. There are three types of features in our models:

1.Spatial Structure: One of the most important predictor features is the spatial lag variable, which calculated the mean prices of neighboring houses within a buffer distance of 1/16 miles for each house sale points. In this way, the model captures the spatial autocorrelation very well. We included this predictor in the model in the end and it significantly increases the model prediction.

2.Amenities: Other important variables include the distance to different types of services, including financial services and real estate leasing services.

3.Internal Characteristics: Property area is the most significant factor that correlates with the home price.

The MAPE map has demonstrated whether the spatial autocorrelation is accounted for in our model. The darker blue regions where house prices higher tend to have higher MAPE, which means our model does not predict well for higher-priced homes values. In contrast, the model predicts the regions with lower house prices much better, with MAPE of about 7%, in other words, our prediction only deviates away from the true prices by 7%. We believe the spatial variation in MAPE is because the prices of high-value homes fluctuate more and are influenced by more complicated factors.

5. Conclusion

We would highly recommend our model to Zillow, as it has a very small error in predicting house prices. Even though it is less representative in higher home value communities, the model still has great MAE, MAPE and r-squared. As we disccussed before, the model can be improved in multiple ways.Additionally, Non-linear models other OLS can be considered for better fit.

HEDONIC HOME PRICE PREDICTION FOR SAN FRANCISCO

Jiazheng Zhu, Hanyong Xu, Jingzong Wang

10/19/2019